Combined mRNAs and clinical factors model on predicting prognosis in patients with triple-negative breast cancer

Objective Triple-negative breast cancer (TNBC) is aggressive cancer usually diagnosed in young women with no effective prognosis prediction model to use. The present study was performed to develop a useful prognostic model for predicting overall survival (OS) for TNBC patients. Methods The Cancer Genome Atlas (TCGA) and Molecular Taxonomy of Breast Cancer International Consortium (METABRIC) databases were used as training and validation data sets, respectively, in which the gene expression levels and clinical prognostic information of TNBC were collected. Differentially expressed genes (DEGs) between TNBC and non-TNBC (NTNBC) were identified with the thresholds of false discovery rate < 0.05 and |log2 Fold Change| > 1. DEGs in AmiGO2 and the Kyoto Encyclopedia of Genes and Genomes (KEGG) databases were retained for further study. Univariate, multivariate Cox, and logistic regression analysis were conducted for detecting DEG signature with the threshold of log-rank P < 0.05. The prognosis models of mRNA signature, clinical factors were constructed and compared. Results One five-DEG signature, including CHST4, COCH, CST9, SOX11, and TDGF1 was identified in DEG prognosis model. Stratified analysis showed that the patients aged over 60, with higher pathologic stage (III-IV) and recurrence induced a significantly lower survival rate than those aged below 60, lower pathologic stage and without recurrence. Compared with patients with low-risk scores, those presented high-risk scores demonstrated significantly lower survival rate in the subgroup aged over 60 [HR = 3.780 (1.801–7.933), P < 0.0001]. For patients who obtained a higher pathologic stage and recurrence, high-risk scores were correlated with a significantly lower survival rate than patients with low-risk scores. The five-mRNA signature combined with clinical model (AUC = 0.950) predicted better than single clinical model (AUC = 0.795) or five-mRNA signature model (AUC = 0.823). Conclusion Our present study identified a prognostic prediction model (combined with five-mRNA signature and clinical factors) for TNBC patients receiving immunotherapy, which will benefit future research and clinical therapies.


Introduction
Triple-negative breast cancer (TNBC) is an aggressive breast cancer negative for progesterone receptor (PR), estrogen receptor (ER), and human epidermal growth factor receptor 2 (Her-2). As known, TNBC patients benefit little from both endocrine therapy and Her-2 targeted therapy, but chemotherapy, the traditional treatment system [1]. Even worse, TNBC patients suffer from earlier recurrence, worse prognosis, and shorter survival time than other breast cancer subtypes [2][3][4]. Nowadays, immunotherapy is drawing extensive attention in TNBC tumor therapy [5][6][7][8]. The efficacy of immunotherapy in most types of breast cancer has not been confirmed when compared with other cancers with higher immunogenicity as malignant melanoma, pulmonary small cell lung cancer [9,10]. However, the TNBC immunotherapy approved by FDA in the United States has obtained outstanding curative effects [11]. Although the success of immunotherapy is exciting, countless patients did not respond to immunotherapy.
The complexity and diversity of the tumor microenvironment (TME) had been gradually understood in recent years. Moreover, its importance in immunotherapy has also been realized. TME was the cellular environment, including immune cells, mesenchymal cells, endothelial cells, inflammatory mediators, and extracellular matrix molecules. It is widely considered that the microenvironment plays a significant role in tumor development [12]. Therefore, the comprehensive analysis of the correlation between gene signatures and overall survival (OS) may shed light on the pathogenesis of TNBC.
Chemotherapy remains the most effective therapy method [13]. In recent years, immunotherapy had been widely studied in cancers, especially TNBC. PD-1/PD-L1 are a pair of immune co-stimulatory molecules contain the medicines of pembrolizumab [14], atezolizumab [15], and durvalumab [16], which were reported to be effective for prolonging OS. Clinical studies have found that immune infiltration could improve prognosis in TNBC patients [17,18]. Therefore, the identification of DEGs of TNBC may contribute to the in-depth analysis of factors affecting survival.
Traditionally, the prognosis was predicted by means of clinical risk factors, including age, tumor size, pathologic stage, and location. With the development of high-throughput sequencing technologies, multigene signatures including miRNA-signature, mRNA-signature, and lncRNA-signature were recognized as valuable biomarkers on predicting breast cancer prognosis, such as Oncotype DX, B-cell/IL8, Mammo-print, and Genomic Grading Index [19][20][21]. In recent years, an increasing number of studies demonstrated that mRNAs play identification, biomarker, and prognosis prediction roles in TNBC patients [22][23][24][25][26][27]. However, there was still no mRNA signature associated with prognosis prediction in TNBC patients ever reported.
The Cancer Genome Atlas (TCGA) database and Molecular Taxonomy of Breast Cancer International Consortium (METABRIC) database provide a wealth of available cohorts about cancer-specific gene expression information and detailed clinical characteristics. They accelerated the molecular analysis of cancers. In the present study, the mRNA signature of TNBC was identified by the usage of four databases, including TCGA, METABRIC, AmiGO2, and the Kyoto Encyclopedia of Genes and Genomes (KEGG). Clinical factor and mRNA signature prognosis prediction models were made and compared. The stratified analysis was performed to predict a more accurate prognosis situation. Here, our objective was to find the most accurate and straightforward prognostic model used in the clinical work of TNBC.
In the present study, a five-mRNA signature (CHST4, COCH, CST9, SOX11, and TDGF1) was constructed to predict the prognosis of TNBC. Moreover, the combined prognosis model of mRNA signature and clinical factors have a better prediction function than a single mRNA signature model or clinical factors.

Clinical information and RNA-Seq dataset in TCGA data set
We downloaded the level three fragments per kilobase million (FPKM) gene-level RNA-Seq data of breast cancer samples produced by Illumina HiSeq 2000 RNA sequencing platform from the TCGA database as a training data set through Genomic Data Commons (GDC) Data Transfer tool (https://portal.gdc.cancer.gov/) before February 01, 2020. The patients with documented complete expression profiles and clinical information, including OS, ER, PR, and Her-2 were selected in the study. This study met the publication guidelines provided by TCGA. At the same time, the gene expression and clinical information of breast cancer samples were downloaded from the METABRIC database (http://molonc.bccrc.ca/) as a validation data set.

DEGs signature identification and survival prognosis models construction
Univariate and multivariate Cox regression analyses were performed to identify the independent prognosis related DEGs based on the clinical prognosis information (recurrence, dead, OS time) of the samples in the TCGA training data set and gene expression levels of DEGs identified above, using the survival package (version 2.41-1, http://cran.r-project.org/web/ packages/survival/index.html) of R3.4.1 (S1 File). Log-rank P < 0.05 was considered as the threshold for a significant correlation. Then, the feature DEGs among the above DEGs were screened out using the Logit regression model by glm function in R3.4.1 (S1 File) [31,32].
The DEG prognostic risk score was calculated based on the expression levels of feature DEGs obtained in the previous steps and the prognostic coefficients of each element in the optimized combination of DEGs. The DEGs prognostic model was calculated as follows: Here β DEGs indicated the coefficient of DEGs derived from multivariate Cox regression, whereas Exp DEGs represents the expression level of the target DEGs in the training data set.
Finally, the DEG prognostic risk score of each sample was evaluated. Taking medium value as the threshold, samples were divided into high-and low-risk groups in the TCGA training data set. We compared the real survival prognosis with the grouped samples by DEGs prognostic score model using survival package version 2.41-1 Kaplan-Meier curve method in R3.4.1 (S1 File).

DEGs prognostic risk model validation analysis
The samples in the TCGA and METABRIC validation data set were both used to validate the performance of this DEG prognostic risk prediction model. The samples in the TCGA training and METABRIC verification data sets were separately classified into TNBC and NTNBC sampling groups by the logistic regression model. The DEGs with P < 0.05 were considered as feature DEGs. Then, the accuracy of classification was validated by comparing the predicted group with the actual group.

Screening for independent prognostic clinical models
The independent prognostic clinical factors in the breast cancer tumor samples of the TCGA training data set were screened using the univariate and multivariate Cox regression analysis in R3.4.1 survival package (version 2.41-1; S1 File). Log-rank P < 0.05 was used as the threshold. Then, the stratified analysis was performed.

Comparison analysis between prognostic clinical factors and DEGs prognostic risk models
To further investigate the correlation of the independent prognostic clinical factors and DEG prognostic risk model, the nomogram analyses of three-and five-year survival rate prediction models were performed. rms package version 5.1-2 of R3.4.1 (S1 File; https://cran.r-project. org/web/packages/rms/index.html) was used [33,34]. Then, the clinical prognosis models were constructed and compared with the DEGs prognostic model.

Data source and preprocessing
Overall, 1,217 samples were assessed, with 710 samples (116 TNBC and 594 NTNBC) with integral clinical information in TCGA training data set were retained for further study. The clinical information statistics are shown in Table 1 and S1 Table. Simultaneously, 299 TNBC and 1,605 NTNBC samples with RNA-Seq expression profiles from the METABRIC database was downloaded as the validation data set.
A total of 15,583 mRNAs were identified in TCGA training data set by removing those with an expression level of 0. With the thresholds of FDR < 0.05 and |log 2 FC| > 1, 884 DEGs, with 578 up-regulated and 306 down-regulated were identified ( Fig 1A; S2 Table). Moreover, the two-dimensional hierarchical clustering heatmap of these DEGs presented an obvious classification on TNBC and NTNBC patients (Fig 1B). A total of 164 DEGs involved in both AmiGO2 and KEGG databases were used for further study (S3 Table).

Screening of characteristic DEGs and constructing of survival prognosis models
In the TCGA training data set, twenty prognosis-related DEGs were selected through univariate Cox regression analysis (S4 Table), seven of which were left after multivariate Cox regression analysis, including cochlin (COCH), carbohydrate sulfotransferase 4 (CHST4), LIM  domain only 1 (LMO1), cystatin 9 (CST9), SRY-box transcription factor 11 (SOX11), histatin 3 (HTN3), and teratocarcinoma-derived growth factor 1 (TDGF1) (S5 Table). Thereafter, a five-DEG signature associated with independent prognosis was screened out and used for the Logit regression model construction.  Table 2. Then, the DEG prognostic risk model was constructed based on the expression profiles of CHST4, COCH, CST9, SOX11, and TDGF1 in the TCGA training data set. The prognostic risk score of each TCGA sample as calculated as: prognostic risk score = (-0.064 According to the median value of risk scores, samples in each data set were divided into the high-and low-risk groups. Significant differences were observed between high-and low-risk groups in TCGA training data set [HR = 2.509 (1.570-4.012), P < 0.0001; Fig 2A]. The patients in the high-and low-risk groups in the METABRIC validation data set also had a difference survival ratio [HR = 1.234 (1.096-1.389), P < 0.0001; Fig 2B]. The receiver operating characteristic (ROC) curve was plotted, with the AUC of 0.823 (95% CI: 0.675-0.956; Fig 2C) in the training and 0.642 (95% CI: 0.605-0.617; Fig 2D) in the validation data sets.

DEGs prognostic risk model validation analysis
The five-DEGs signature was employed on the classification of TNBC and NTNBC samples in the TCGA training and the METABRIC validation data sets. The scatter distribution map revealed that the five-DEG signature could precisely identified TNBC and NTNBC samples both in the TCGA training (AUC = 0.938, 95% CI: 0.897-0.921; Fig 3A and 3C) and the METABRIC validation data sets (AUC = 0.831, 95% CI: 0.749-0.864; Fig 3B and 3D). When compared the predicted classification with the actual group, the overall TNBC predictive percent of this model was 94.08% and 92.91% in the TCGA training and METABRIC validation data sets, respectively (Fig 3E and 3F).

Screening for independent prognostic clinical models in the TCGA training cohort
In the TCGA training set, several clinical independent factors significantly associated with OS were screened out, including the age [HR = 1.  (Table 3). Then, we stratified samples into subgroups according to age (> 60-year-old   Fig 4G] induced a significantly lower survival rate than those were younger, at lower pathologic stage and without recurrence. OS showed no significant differences between the high-and low-risk groups for patients aged under 60, with early pathologic stage, and without recurrence (P > 0.05; Fig 4B,

Comparison analysis between prognostic clinical factors and DEGs prognostic risk models
The survival nomogram model analysis performed in the TCGA training data set samples revealed that age and DEG prognostic contributed most to the three-year and five-year OS (Fig 5A). The predictive three-year (C-index = 0.872) and five-year (C-index = 0.856) survival probability based on the model was basically in line with the actual survival rates (Fig 5B).

Discussion
TNBC is a heterogeneous and aggressive disease with short treatment-to-relapse time and a high rate of visceral metastasis [35]. Its long-term prognosis is worse than other breast cancer subtypes [36]. Worse, TNBC recurrence is known to happen within the first three years [37] after therapy. Once recurrence and metastasis occur, the median survival time is less than one year [38]. Chemotherapy remains the most effective therapy method [13]. In recent years, immunotherapy had been widely studied in cancers, especially TNBC. PD-1/PD-L1 are a pair of immune co-stimulatory molecules contain the medicines of pembrolizumab [14], atezolizumab [15], and durvalumab [16], which were reported to be effective for prolonging OS. Clinical studies have found that immune infiltration could improve prognosis in TNBC patients [17,18]. Therefore, the identification of DEGs of TNBC may contribute to the in-depth analysis of factors affecting survival. Databases of TCGA, METABRIC, AmiGO 2, and KEGG  provide massive comprehensive and reliable high-throughput sequences for mining. In the present study, a five-DEG signature (CHST4, COCH, CST9, SOX11, and TDGF1) was constructed to predict the prognosis of TNBC. Moreover, the combined prognosis model of DEG signature and clinical factors have a better prediction function than a single DEG signature model or clinical factors. The signature in our present study consisted of five genes, including CHST4, COCH, CST9, SOX11, and TDGF1. CHST4, enriched in the biological process of immune response, was responsive to any potential internal or invasive threat of an organism [39]. COCH, enriched in the biological process of regulating innate immune response positively, is the first line of defense against infection through activating and increasing the frequency, rate, and extent of the innate immune response. CST9 was enriched in the immune response against microbes mediated through body fluid. SOX11 was enriched in the biological process of negatively regulation of lymphocyte proliferation by stopping, preventing, and reducing the rate or extent of  lymphocyte proliferation. TDGF1 was enriched in the GO term of cellular response to interferon-gamma. These genes were involved in immune-related GO terms and play essential roles in immune regulation. However, the mechanism of them in TNBC patients still need further analysis. In recent years, the association between mRNAs, lncRNAs, and the prognosis of TNBC had been widely studied. In the study of Jiang et al. [26], one lncRNA-mRNA signature was developed to predict taxane chemotherapy beneficiation and recurrence in TNBC and assist individual frame treatment for TNBC patients. Loibl et al. [40] predicted an immune-related mRNA signature TIL and IFN-γ to respond durvalumab in primary TNBC. Ren et al. [41] reported a seven-gene signature (1.108 � TMEM101-0.213 � KRT5-0.315 � ACAN-0.464 � LCA5+-0.446 � RPP40-0.373 � LAGE3-0.257 � CDKL2) on predicting prognostic and de-escalating treatment for early-stage TNBC. In the study of Wang et al. [23], they developed a response score with one lncRNA and two mRNAs (2.595 � BPESC1-1.09 � WDR72-1.428 � GADD45A-0.731), which could be employed clinically to predict complete pathological responses in TNBC patients receiving neoadjuvant chemotherapy. Integrated signature of three DE mRNAs (TRIM59, EX01, and RAD51AP1), one DE lncRNA (KIRREL3-AS1), and one DE miRNA (hsa-mir-106a) was found to be significantly associated with the prognosis of patients with TNBC [42]. However, the application of these molecular signatures was limited because of the clinical factors and more attention on molecular factors. In this study, the prognostic risk was predicted by a single clinical factor, five-mRNA signature, and combination. The predictive performance of our combined prognostic model was found to be superior to those of the five-mRNA signature, age, stage, recurrence, and clinical alone. Prognostic models should be as simple as possible for patients and doctors to utilize in clinical practice. Moreover, it should be accurate. Our combined prognosis prediction model was based on routine factors, including genetic differences (the five-mRNA signature), baseline demographic factor (age), histopathological characteristic (pathologic stage), and prognostic factor (recurrence). The combined essential factors make the association between risk factors and outcomes more accurate. Consequently, the prognosis risk of TNBC patients can be easily estimated.
A good prognosis prediction model has always been made by considering stratified clinical factors, in which the patients were divided into high and low-risk groups. The prognosis prediction model with stratified factors was more accurate than without. The five-DEG signature performed well on risk stratification in subgroups of pathologic stage III-IV, with recurrence, and those aged over 60. As for the impact of age on prognostic, Liedtke et al. [43] demonstrated that patients � 40 years old have poorer survival despite more aggressive systemic therapy, which was 20-years earlier than that demonstrated in our present study. The result might clarify a more accurate signature in our study. As for the pathologic stage, He et al. [44] reported that higher stage correlated with longer disease-free survival (HR = 3.13, 95% CI: 1.94-5.06), which was consistent with the result in our study (HR = 2.873, 95% CI: 1.830-4.511). However, our present study was more comprehensive, which consisted of the stage of IV. Moreover, the predictive capacity of factors, including five-DEG signature, age, pathologic stage, and recurrence, were independent of each other.
There are some limitations in the present study. Firstly, we only detected the model but employed in the actual clinical work. Secondly, these genes needed to be further elucidated on the mechanism and functions of TNBC.

Conclusions
Our present study identified a TNBC prognostic prediction model with five-DEG signature and clinical factors, which will benefit for future research and clinical therapies.
Supporting information S1 File. The R code of the whole study. (DOCX) S1 Table. The clinical information statistics. The clinical information of overall survival time, age, stage, radiation, targeted-therapy, and recurrence were provided. (XLSX) S2 Table. The list of differential expressed genes. The thresholds were set as false discovery rate (FDR) < 0.05 and |log 2 Fold Change| > 1.